Exec. Summary

Row

Executive Summary

This project examined two questions:

  1. What factors influence the price per night of an airbnb property in Denver, Colorado?
  2. What factors influence a property receiving a perfect score for value?

Data

This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO.

The original dataset has 3050 rows and 32 variables. The variables were all normalized because of their vast differences in scale (some variables had values in the thousands; others had a maximum value of 5). The number of normalized variables was 105.

Methodology

Analyses were done to predict two variables. Price was the continuous variable, and value_score_cat was the categorical variable. There were two values for value_score_cat. A perfect score of 5.0 was coded as “excellent”, and anything else was coded as “other”.

The following methodology was undertaken for each of the variables: • Examine the distribution of the variables to look for relationships • Conduct simple regression analyses for baselines • Normalize all variables • Employ lasso regression to narrow down the number of variables • Build one random forest model with all predictors and a second one with only the lasso predictors

Column

Conclusions

Price

Number of baths ended up being the most important predictor of price, by a huge margin. Number of baths is a good proxy for property size: more baths suggests larger property. Price being higher for a larger property makes sense.

Looking at the analyses as a whole, the three most important factors that a host could change are minimum nights, review scores rating, and maximum nights. We see from the regression coefficients that having a lower number of minimum nights, a higher number of maximum nights, and a higher review scores rating all increase price. The recommendations for a host would be to be flexible (allow for both shorter and longer stays) and to work hard to please the customers.

Value Score

Review scores rating is the most important predictor of value score.

Interestingly, review scores rating is negatively correlated with value score. This would mean that higher-rated properties overall are rated as having lower value for the price. The other two significant predictors from the lasso regression are the number of reviews in the last 12 months and the number of months since the first review of the property. These are both positively correlated to value score, which means that properties that have been airbnbs longer and which have had more customers recently receive better ratings for value.

The takeaway from the negative correlation between review scores rating and value score may be that, even though customers have a great overall experience at an airbnb, they may believe that it was overpriced.

More research should be done before offering airbnb hosts advice on this matter, however.

Introduction

Row

The Project

The Problem Description

This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO. The website is http://insideairbnb.com/get-the-data. Variables from the downloaded detailed dataset have been transformed in R in order to conduct the analyses. The distribution of variables will be examined first in order to look for relationships. Regression analysis will be performed in order to predict a property’s per-night price. Following normalization of the variables, lasso regression will be employed to narrow down the number of variables, as well. One random forest model will be run with all predictors and a second one with the lasso predictors. A classification analysis will be used to predict a customer’s rating of the value of the property (whether the property is a good value for the price paid). Because the ratings skew high, there were two ratings assigned: ‘excellent’ for a perfect rating and ‘other’ for anything less. Lasso regression will again be used in order to narrow down the number of predictors, followed by logistic regression. One random forest model will be run with all predictors and a second one with the lasso predictors. Finally, a summary of conclusions will be presented.

The Data

This dataset has 3050 rows and 32 variables.

Data Sources

This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO. The website is http://insideairbnb.com/get-the-data.

The Data

VARIABLES TO PREDICT WITH

  • host_in_denver: whether or not the host lives in Denver
  • host_response_time: how quickly a host responds to messages
  • host_response_rate: the percentage of requests to which a host has responded
  • host_acceptance_rate: the percentage of booking requests that a host accepts
  • host_is_superhost: whether a host has earned the Airbnb distinction “Superhost”
  • neighborhood: neighborhood name of the property’s location
  • latitude: latitude of property listing
  • longitude: longitude of property listing
  • room_type: whether the entire home/apartment is being rented or if it is a private room
  • max_guests: maximum number of guests
  • bedrooms: number of bedrooms
  • beds: total number of beds
  • price: per/night price (in US$) - this is a predictor for the classification analysis
  • min_nights: minimum number of consecutive nights that the property may be rented
  • max_nights: maximum number of consecutive nights that the property may be rented
  • number_of_reviews: total number of reviews that a listing has
  • number_of_reviews_ltm: number of reviews that a listing has had in the last twelve months
  • review_scores_rating: overall average rating for the property
  • review_scores_accuracy: average rating for the accuracy of the property’s description
  • review_scores_cleanliness: average rating for the cleanliness of the property
  • review_scores_checkin: average rating for the ease of check-in
  • review_scores_communication: average rating for the host’s communication
  • review_scores_location: average rating for the property’s location
  • review_scores_value: average rating for the renters’ assessments of value of the rental experience for the price
  • calculated_host_listings_count: number of listings that a host has
  • reviews_per_month: number of total reviews / the number of months that the property has been listed
  • host_tenure: number of months that the host has been an Airbnb member
  • since_first_review: number of months since the first review
  • since_last_review: number of months since the most recent review
  • num_bath: number of bathrooms available to guests
  • bath_type: whether the bathroom(s) are shared or private

VARIABLES WE WANT TO PREDICT

  • price: per/night price (in US$)
  • value_score_cat: value calculated from the review_scores_value – if the review_scores_value is 5.0, the property is rated as ‘excellent’, if not, it is rated ‘other’

Explorations

Column

View the Data Summaries

Here are the ranges of values for each variable.
 host_in_denver          host_response_time host_response_rate
 no : 384       a few days or more:   2     Min.   : 17.00    
 yes:2666       within a day      : 147     1st Qu.:100.00    
                within a few hours: 287     Median :100.00    
                within an hour    :2614     Mean   : 98.92    
                                            3rd Qu.:100.00    
                                            Max.   :100.00    
                                                              
 host_acceptance_rate host_is_superhost        neighborhood     latitude    
 Min.   : 20.00       no :1047          Five Points  : 258   Min.   :39.63  
 1st Qu.: 94.00       yes:2003          Highland     : 238   1st Qu.:39.73  
 Median : 99.00                         West Colfax  : 139   Median :39.75  
 Mean   : 94.27                         West Highland: 117   Mean   :39.74  
 3rd Qu.:100.00                         Union Station: 108   3rd Qu.:39.76  
 Max.   :100.00                         Berkeley     : 105   Max.   :39.82  
                                        (Other)      :2085                  
   longitude                room_type      max_guests        bedrooms    
 Min.   :-105.1   Entire home/apt:2581   Min.   : 1.000   Min.   :1.000  
 1st Qu.:-105.0   Hotel room     :   7   1st Qu.: 2.000   1st Qu.:1.000  
 Median :-105.0   Private room   : 442   Median : 4.000   Median :2.000  
 Mean   :-105.0   Shared room    :  20   Mean   : 4.265   Mean   :1.818  
 3rd Qu.:-105.0                          3rd Qu.: 6.000   3rd Qu.:2.000  
 Max.   :-104.7                          Max.   :16.000   Max.   :9.000  
                                                                         
      beds            price          min_nights       max_nights     
 Min.   : 1.000   Min.   :  15.0   Min.   :  1.00   Min.   :    1.0  
 1st Qu.: 1.000   1st Qu.:  89.0   1st Qu.:  2.00   1st Qu.:  365.0  
 Median : 2.000   Median : 125.0   Median :  2.90   Median : 1125.0  
 Mean   : 2.292   Mean   : 167.8   Mean   : 18.48   Mean   :  866.9  
 3rd Qu.: 3.000   3rd Qu.: 189.0   3rd Qu.: 29.00   3rd Qu.: 1125.0  
 Max.   :14.000   Max.   :2614.0   Max.   :365.00   Max.   :10000.0  
                                                                     
 number_of_reviews number_of_reviews_ltm review_scores_rating
 Min.   :   1.00   Min.   :  0.00        Min.   :1.000       
 1st Qu.:   6.00   1st Qu.:  2.00        1st Qu.:4.820       
 Median :  28.00   Median : 10.00        Median :4.930       
 Mean   :  68.88   Mean   : 20.13        Mean   :4.852       
 3rd Qu.:  86.00   3rd Qu.: 31.00        3rd Qu.:5.000       
 Max.   :1338.00   Max.   :400.00        Max.   :5.000       
                                                             
 review_scores_accuracy review_scores_cleanliness review_scores_checkin
 Min.   :1.00           Min.   :1.00              Min.   :1.000        
 1st Qu.:4.87           1st Qu.:4.82              1st Qu.:4.920        
 Median :4.95           Median :4.94              Median :4.980        
 Mean   :4.88           Mean   :4.85              Mean   :4.917        
 3rd Qu.:5.00           3rd Qu.:5.00              3rd Qu.:5.000        
 Max.   :5.00           Max.   :5.00              Max.   :5.000        
                                                                       
 review_scores_communication review_scores_location review_scores_value
 Min.   :1.000               Min.   :1.000          Min.   :1.000      
 1st Qu.:4.920               1st Qu.:4.840          1st Qu.:4.730      
 Median :4.980               Median :4.930          Median :4.850      
 Mean   :4.905               Mean   :4.866          Mean   :4.778      
 3rd Qu.:5.000               3rd Qu.:5.000          3rd Qu.:4.940      
 Max.   :5.000               Max.   :5.000          Max.   :5.000      
                                                                       
 calculated_host_listings_count reviews_per_month  host_tenure     
 Min.   :  1.0                  Min.   : 0.020    Min.   :  1.324  
 1st Qu.:  1.0                  1st Qu.: 0.510    1st Qu.: 69.077  
 Median :  1.0                  Median : 1.590    Median : 84.511  
 Mean   : 11.8                  Mean   : 2.187    Mean   : 82.707  
 3rd Qu.:  4.0                  3rd Qu.: 3.230    3rd Qu.:102.482  
 Max.   :243.0                  Max.   :40.000    Max.   :173.808  
                                                                   
 since_first_review  since_last_review      num_bath       bath_type   
 Min.   :  0.04244   Min.   :  0.04244   Min.   :0.500   private:2855  
 1st Qu.:  7.56605   1st Qu.:  0.56810   1st Qu.:1.000   shared : 195  
 Median : 19.39357   Median :  1.29090   Median :1.000                 
 Mean   : 30.36135   Mean   :  3.27590   Mean   :1.499                 
 3rd Qu.: 48.32991   3rd Qu.:  3.03217   3rd Qu.:2.000                 
 Max.   :166.28474   Max.   :132.60917   Max.   :9.500                 
                                                                       
  value_score_cat
 excellent: 603  
 other    :2447  
                 
                 
                 
                 
                 

Column

Average Price by max_guests (Maximum number of guests at one time)

max_guests n mean(price)
1 106 51.85
2 918 104.55
3 231 117.15
4 823 142.26
5 162 169.94
6 398 228.21
7 56 293.89
8 154 292.26
9 21 278.29
10 89 339.85
11 13 413.15
12 44 502.73
13 7 532.00
14 12 686.58
15 5 363.60
16 11 825.82

Visualizations

Response Variables relationships with predictors

  • Unsurprisingly, we see that the price values are very right-skewed. Although the median price is $125, the maximum price is $2614. Of the continuous predictors, max_guests, beds, bedrooms, and num_bath have the highest correlations with price. That would be logical, since a property that can accommodate more guests would likely command a higher price.

  • We see that the ‘excellent’ value score category makes up approximately 20% of the total. One interesting finding is that non-superhosts outnumber superhosts in the ‘excellent’ value score category, even though the percentage of superhosts is much higher than the percentage of hosts who are not superhosts.

  • After finding collinearity between beds, bedrooms, num_baths, and max_guests, individual regressions were run between price and the four collinear variables in order to select the one with the largest coefficient. Num_bath, having the highest coefficient of 127.8, was retained in the dataset. The other three variables were eliminated.

row

Value Score

Price

Row

Price vs Continuous Variables #1

Price vs Continuous Variables #2

Price vs Continuous Variables #3

Price vs Continuous Variables #4

Price vs Continuous Variables #5

Checking Correlations Between Property Size Proxy Variables

Checking Regressions Individually Between Price and Property Size Proxy Variables

term estimate std.error statistic p.value
(Intercept) 14.411 4.799 3.003 0.003
max_guests 35.970 0.961 37.411 0.000
term estimate std.error statistic p.value
(Intercept) -23.688 5.005 -4.733 0
num_bath 127.807 2.943 43.430 0
term estimate std.error statistic p.value
(Intercept) -5.308 5.036 -1.054 0.292
bedrooms 95.219 2.419 39.362 0.000
term estimate std.error statistic p.value
(Intercept) 38.590 4.586 8.414 0
beds 56.378 1.659 33.988 0

Value Score vs Host Is Superhost

Initial Models

Row

Baseline Models

Linear and logistic regression models were run as baseline models. All predictors were used with these initial models.

We can see from the actual-predicted plot for the linear regression that there are a number of price outliers that are likely causing problems with predictions. Given the large number of outliers, the R-square value of .448 is not surprising.

In contrast, the logistic regression seems to be fairly accurate overall, with accuracy of .912 and an AUC of .96. It is achieving this accuracy with a high specificity of .946 and a rather lower sensitivity of .776.


Row{data-height=2000, column-width = 700, .tabset .tabset-fade}

Predicting Price

Here is the initial regression model predicting price using all predictors.

term estimate std.error statistic p.value
(Intercept) 60755.074 41260.513 1.472 0.141
host_in_denveryes 13.445 7.812 1.721 0.085
host_response_timewithin a day 100.638 99.732 1.009 0.313
host_response_timewithin a few hours 99.962 100.740 0.992 0.321
host_response_timewithin an hour 113.061 100.939 1.120 0.263
host_response_rate -0.850 0.562 -1.512 0.131
host_acceptance_rate -0.126 0.235 -0.537 0.592
host_is_superhostyes -13.343 5.532 -2.412 0.016
neighborhoodAuraria -129.488 79.650 -1.626 0.104
neighborhoodBaker -26.446 28.664 -0.923 0.356
neighborhoodBarnum 17.528 39.377 0.445 0.656
neighborhoodBarnum West 3.667 39.808 0.092 0.927
neighborhoodBear Valley 11.676 80.857 0.144 0.885
neighborhoodBelcaro 364.047 64.303 5.661 0.000
neighborhoodBerkeley -22.361 44.717 -0.500 0.617
neighborhoodCapitol Hill -21.097 33.117 -0.637 0.524
neighborhoodCBD 32.263 34.807 0.927 0.354
neighborhoodChaffee Park -15.118 53.273 -0.284 0.777
neighborhoodCheesman Park -35.447 36.379 -0.974 0.330
neighborhoodCherry Creek -105.990 44.180 -2.399 0.016
neighborhoodCity Park -54.655 44.770 -1.221 0.222
neighborhoodCity Park West -80.398 37.821 -2.126 0.034
neighborhoodCivic Center -131.092 43.657 -3.003 0.003
neighborhoodClayton -84.017 48.443 -1.734 0.083
neighborhoodCole 24.432 44.486 0.549 0.583
neighborhoodCollege View - South Platte 0.436 63.453 0.007 0.995
neighborhoodCongress Park -65.791 38.852 -1.693 0.090
neighborhoodCory - Merrill -62.961 51.427 -1.224 0.221
neighborhoodCountry Club -39.348 51.111 -0.770 0.441
neighborhoodDIA -265.768 113.648 -2.339 0.019
neighborhoodEast Colfax -137.024 57.979 -2.363 0.018
neighborhoodElyria Swansea -80.791 62.249 -1.298 0.194
neighborhoodFive Points -34.935 37.422 -0.934 0.351
neighborhoodFort Logan 7.566 74.343 0.102 0.919
neighborhoodGateway - Green Valley Ranch -265.493 100.014 -2.655 0.008
neighborhoodGlobeville -55.155 60.450 -0.912 0.362
neighborhoodGoldsmith -44.727 58.000 -0.771 0.441
neighborhoodHale -95.195 46.082 -2.066 0.039
neighborhoodHampden -93.596 64.408 -1.453 0.146
neighborhoodHampden South -123.880 62.397 -1.985 0.047
neighborhoodHarvey Park 3.650 56.115 0.065 0.948
neighborhoodHarvey Park South 19.938 65.228 0.306 0.760
neighborhoodHighland -4.726 37.880 -0.125 0.901
neighborhoodHilltop -89.985 59.848 -1.504 0.133
neighborhoodIndian Creek -251.374 103.117 -2.438 0.015
neighborhoodJefferson Park -38.646 36.428 -1.061 0.289
neighborhoodLincoln Park -40.319 32.741 -1.231 0.218
neighborhoodLowry Field -178.195 70.397 -2.531 0.011
neighborhoodMar Lee -27.764 51.665 -0.537 0.591
neighborhoodMarston -31.277 134.454 -0.233 0.816
neighborhoodMontbello -184.683 88.385 -2.090 0.037
neighborhoodMontclair -84.039 54.463 -1.543 0.123
neighborhoodNorth Capitol Hill -36.051 39.319 -0.917 0.359
neighborhoodNorth Park Hill -104.733 50.581 -2.071 0.038
neighborhoodNortheast Park Hill -110.563 54.101 -2.044 0.041
neighborhoodOverland -43.115 38.319 -1.125 0.261
neighborhoodPlatt Park -35.478 32.891 -1.079 0.281
neighborhoodRegis -18.087 53.135 -0.340 0.734
neighborhoodRosedale -37.973 39.872 -0.952 0.341
neighborhoodRuby Hill -8.124 36.318 -0.224 0.823
neighborhoodSkyland -62.148 45.105 -1.378 0.168
neighborhoodSloan Lake 13.762 36.528 0.377 0.706
neighborhoodSouth Park Hill -81.466 49.597 -1.643 0.101
neighborhoodSpeer -38.445 29.333 -1.311 0.190
neighborhoodStapleton -181.470 65.552 -2.768 0.006
neighborhoodSun Valley -20.041 129.607 -0.155 0.877
neighborhoodSunnyside -43.260 43.492 -0.995 0.320
neighborhoodUnion Station -5.792 36.135 -0.160 0.873
neighborhoodUniversity -13.795 38.585 -0.358 0.721
neighborhoodUniversity Hills -86.977 52.684 -1.651 0.099
neighborhoodUniversity Park 64.361 43.125 1.492 0.136
neighborhoodValverde -17.472 42.216 -0.414 0.679
neighborhoodVilla Park -18.166 34.143 -0.532 0.595
neighborhoodVirginia Village -80.757 44.736 -1.805 0.071
neighborhoodWashington Park -36.897 38.428 -0.960 0.337
neighborhoodWashington Park West -19.210 31.855 -0.603 0.547
neighborhoodWashington Virginia Vale -121.983 46.910 -2.600 0.009
neighborhoodWellshire -79.543 67.590 -1.177 0.239
neighborhoodWest Colfax -16.450 32.040 -0.513 0.608
neighborhoodWest Highland 14.683 40.075 0.366 0.714
neighborhoodWestwood 8.254 37.848 0.218 0.827
neighborhoodWhittier -44.132 41.809 -1.056 0.291
neighborhoodWindsor -131.054 76.982 -1.702 0.089
latitude 350.855 494.402 0.710 0.478
longitude 712.799 348.336 2.046 0.041
room_typeHotel room -97.304 50.425 -1.930 0.054
room_typePrivate room -21.999 8.608 -2.556 0.011
room_typeShared room -94.165 32.582 -2.890 0.004
min_nights -0.196 0.102 -1.924 0.054
max_nights 0.008 0.005 1.511 0.131
number_of_reviews -0.107 0.042 -2.539 0.011
number_of_reviews_ltm 0.154 0.191 0.809 0.419
review_scores_rating 53.688 18.974 2.830 0.005
review_scores_accuracy 14.818 15.703 0.944 0.345
review_scores_cleanliness 5.910 11.543 0.512 0.609
review_scores_checkin -12.909 15.368 -0.840 0.401
review_scores_communication -4.704 14.054 -0.335 0.738
review_scores_location 5.775 12.125 0.476 0.634
review_scores_value -38.161 13.996 -2.727 0.006
calculated_host_listings_count -0.006 0.135 -0.044 0.965
reviews_per_month -0.062 2.339 -0.026 0.979
host_tenure 0.107 0.084 1.284 0.199
since_first_review 0.378 0.132 2.870 0.004
since_last_review -0.493 0.391 -1.263 0.207
num_bath 126.016 3.093 40.739 0.000
bath_typeshared -22.623 12.523 -1.807 0.071

The Full Regression Model Metrics

model rmse mae rsq
reg_all_pred 123.485 66.392 0.448

Actual vs Predicted Graph

Predicting Value Score Category

Here is the initial logistic regression model predicting value score category using all predictors.

term estimate std.error statistic p.value
(Intercept) 1416.850 1357.460 1.044 0.297
host_in_denveryes 0.510 0.224 2.281 0.023
host_response_timewithin a day 2.111 3.480 0.607 0.544
host_response_timewithin a few hours 2.247 3.512 0.640 0.522
host_response_timewithin an hour 2.274 3.515 0.647 0.518
host_response_rate -0.019 0.019 -0.981 0.327
host_acceptance_rate -0.005 0.006 -0.920 0.357
host_is_superhostyes 0.456 0.164 2.786 0.005
neighborhoodAuraria -4.078 6.122 -0.666 0.505
neighborhoodBaker 0.384 0.935 0.410 0.681
neighborhoodBarnum -1.911 1.285 -1.487 0.137
neighborhoodBarnum West 2.187 1.484 1.474 0.140
neighborhoodBear Valley 16.619 1872.800 0.009 0.993
neighborhoodBelcaro 0.048 1.637 0.029 0.977
neighborhoodBerkeley 0.127 1.482 0.086 0.932
neighborhoodCapitol Hill -0.803 1.120 -0.717 0.473
neighborhoodCBD 0.085 1.109 0.077 0.939
neighborhoodChaffee Park -2.151 1.746 -1.232 0.218
neighborhoodCheesman Park 0.601 1.297 0.463 0.643
neighborhoodCherry Creek -0.155 1.314 -0.118 0.906
neighborhoodCity Park -0.581 1.379 -0.421 0.673
neighborhoodCity Park West -0.876 1.220 -0.718 0.473
neighborhoodCivic Center -0.589 1.333 -0.442 0.658
neighborhoodClayton -1.927 1.610 -1.197 0.231
neighborhoodCole -1.342 1.484 -0.904 0.366
neighborhoodCollege View - South Platte 16.104 1295.522 0.012 0.990
neighborhoodCongress Park -0.952 1.236 -0.770 0.441
neighborhoodCory - Merrill 0.346 1.610 0.215 0.830
neighborhoodCountry Club -1.671 1.710 -0.977 0.328
neighborhoodDIA -7.320 3.622 -2.021 0.043
neighborhoodEast Colfax -4.387 1.791 -2.449 0.014
neighborhoodElyria Swansea -1.774 1.996 -0.889 0.374
neighborhoodFive Points -1.334 1.249 -1.068 0.286
neighborhoodFort Logan 18.142 1600.953 0.011 0.991
neighborhoodGateway - Green Valley Ranch -8.075 3.264 -2.474 0.013
neighborhoodGlobeville -1.721 1.833 -0.939 0.348
neighborhoodGoldsmith -0.905 1.673 -0.541 0.588
neighborhoodHale -1.124 1.474 -0.763 0.446
neighborhoodHampden -0.871 2.047 -0.425 0.671
neighborhoodHampden South 0.437 1.938 0.225 0.822
neighborhoodHarvey Park 0.971 2.218 0.438 0.662
neighborhoodHarvey Park South 16.568 1366.932 0.012 0.990
neighborhoodHighland -0.286 1.278 -0.223 0.823
neighborhoodHilltop -2.359 1.818 -1.298 0.194
neighborhoodIndian Creek -4.907 3.018 -1.626 0.104
neighborhoodJefferson Park -0.270 1.200 -0.225 0.822
neighborhoodLincoln Park 0.349 1.193 0.293 0.770
neighborhoodLowry Field -1.550 1.892 -0.819 0.413
neighborhoodMar Lee 1.772 1.290 1.374 0.169
neighborhoodMarston -11.370 3956.181 -0.003 0.998
neighborhoodMontbello -7.997 3.335 -2.398 0.016
neighborhoodMontclair -1.634 1.804 -0.906 0.365
neighborhoodNorth Capitol Hill 0.231 1.254 0.184 0.854
neighborhoodNorth Park Hill -2.812 1.633 -1.722 0.085
neighborhoodNortheast Park Hill -2.368 1.790 -1.323 0.186
neighborhoodOverland 1.196 1.100 1.087 0.277
neighborhoodPlatt Park 1.047 1.064 0.984 0.325
neighborhoodRegis -0.459 1.830 -0.251 0.802
neighborhoodRosedale 1.960 1.336 1.467 0.142
neighborhoodRuby Hill 2.298 1.165 1.973 0.048
neighborhoodSkyland -1.831 1.478 -1.238 0.216
neighborhoodSloan Lake 0.908 1.210 0.750 0.453
neighborhoodSouth Park Hill -3.842 1.635 -2.349 0.019
neighborhoodSpeer 0.397 0.925 0.429 0.668
neighborhoodStapleton -3.920 2.213 -1.772 0.076
neighborhoodSun Valley 10.727 3956.181 0.003 0.998
neighborhoodSunnyside -1.453 1.457 -0.997 0.319
neighborhoodUnion Station -0.420 1.169 -0.359 0.719
neighborhoodUniversity -0.688 1.240 -0.555 0.579
neighborhoodUniversity Hills -0.518 1.601 -0.324 0.746
neighborhoodUniversity Park 0.924 1.316 0.702 0.483
neighborhoodValverde -0.028 1.594 -0.018 0.986
neighborhoodVilla Park 0.524 1.170 0.448 0.654
neighborhoodVirginia Village -1.397 1.414 -0.988 0.323
neighborhoodWashington Park -0.656 1.235 -0.531 0.595
neighborhoodWashington Park West -0.169 0.951 -0.178 0.859
neighborhoodWashington Virginia Vale -1.448 1.488 -0.973 0.330
neighborhoodWellshire 14.672 1409.739 0.010 0.992
neighborhoodWest Colfax 0.583 1.062 0.549 0.583
neighborhoodWest Highland 0.468 1.335 0.350 0.726
neighborhoodWestwood -0.362 1.118 -0.324 0.746
neighborhoodWhittier -1.141 1.408 -0.810 0.418
neighborhoodWindsor -3.385 2.818 -1.202 0.230
latitude 25.193 16.280 1.547 0.122
longitude 22.499 11.474 1.961 0.050
room_typeHotel room 4.478 811.739 0.006 0.996
room_typePrivate room -0.510 0.297 -1.718 0.086
room_typeShared room -1.304 1.403 -0.930 0.352
price 0.000 0.001 0.270 0.787
min_nights 0.003 0.003 0.831 0.406
max_nights 0.000 0.000 1.673 0.094
number_of_reviews 0.134 0.023 5.934 0.000
number_of_reviews_ltm 0.018 0.028 0.643 0.520
review_scores_rating -4.533 0.688 -6.586 0.000
review_scores_accuracy -2.686 0.612 -4.386 0.000
review_scores_cleanliness -0.747 0.398 -1.876 0.061
review_scores_checkin 0.189 0.496 0.381 0.704
review_scores_communication -0.986 0.520 -1.895 0.058
review_scores_location -2.817 0.474 -5.946 0.000
calculated_host_listings_count 0.001 0.004 0.185 0.853
reviews_per_month -0.054 0.089 -0.612 0.541
host_tenure 0.000 0.003 -0.144 0.886
since_first_review 0.019 0.007 2.568 0.010
since_last_review -0.047 0.011 -4.195 0.000
num_bath -0.007 0.113 -0.064 0.949
bath_typeshared 0.267 0.414 0.644 0.519

The Full Classification Model Metrics

model accuracy sensitivity specificity auc
log_all_pred 0.912 0.776 0.946 0.96

ROC Curve

Lasso-P

Column

Predicting Price

Here is the lasso regression model which narrowed down the number of predictors used to predict price.


Results of linear regression with lasso-reduced predictors

term estimate std.error statistic p.value
(Intercept) 167.845 2.733 61.404 0.000
host_response_rate -0.830 2.918 -0.284 0.776
longitude -3.719 4.437 -0.838 0.402
min_nights -4.873 3.004 -1.622 0.105
max_nights 4.943 2.735 1.807 0.071
review_scores_rating 6.235 2.849 2.189 0.029
since_first_review 4.348 2.768 1.571 0.116
num_bath 103.019 2.842 36.252 0.000
host_in_denver_yes 4.591 3.049 1.506 0.132
neighborhood_Auraria -1.862 2.793 -0.667 0.505
neighborhood_Belcaro 2.743 5.098 0.538 0.591
neighborhood_Berkeley -3.407 2.965 -1.149 0.251
neighborhood_Capitol.Hill 3.345 2.690 1.243 0.214
neighborhood_CBD 10.796 2.979 3.624 0.000
neighborhood_Cheesman.Park -1.260 2.765 -0.456 0.649
neighborhood_City.Park.West -5.008 2.889 -1.733 0.083
neighborhood_Civic.Center -6.908 2.971 -2.325 0.020
neighborhood_Cole 13.665 2.641 5.174 0.000
neighborhood_Gateway…Green.Valley.Ranch -7.447 3.944 -1.888 0.059
neighborhood_Goldsmith 2.315 2.441 0.949 0.343
neighborhood_Highland 5.981 2.811 2.128 0.033
neighborhood_Indian.Creek NA NA NA NA
neighborhood_Lowry.Field -3.705 3.962 -0.935 0.350
neighborhood_Mar.Lee -3.379 2.896 -1.167 0.243
neighborhood_Overland -2.796 3.095 -0.903 0.366
neighborhood_Stapleton -5.636 2.723 -2.070 0.039
neighborhood_Union.Station 6.707 2.745 2.444 0.015
neighborhood_Washington.Park.West -2.449 2.732 -0.896 0.370
neighborhood_Washington.Virginia.Vale -2.939 2.735 -1.075 0.283
neighborhood_West.Highland 3.030 2.951 1.027 0.305
neighborhood_Whittier 0.641 2.822 0.227 0.820
room_type_Hotel.room -1.168 3.512 -0.333 0.740
room_type_Shared.room -1.315 2.949 -0.446 0.656
bath_type_shared -10.983 2.992 -3.671 0.000

Column

Metrics for Model Including All Lasso Predictors

model rmse mae rsq
reg_all_pred 123.485 66.392 0.448
all_predictor_lasso 126.494 64.499 0.406

Column

Results of linear regression after backward elimination of insignificant predictors

The final equation for the lasso regression is: price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway…Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared

term estimate std.error statistic p.value
(Intercept) 168.213 2.725 61.729 0.000
min_nights -6.955 2.748 -2.531 0.011
max_nights 5.629 2.687 2.095 0.036
review_scores_rating 6.200 2.827 2.193 0.028
num_bath 102.279 2.820 36.270 0.000
neighborhood_CBD 11.250 2.839 3.962 0.000
neighborhood_Civic.Center -6.520 2.966 -2.198 0.028
neighborhood_Cole 13.516 2.619 5.160 0.000
neighborhood_Gateway…Green.Valley.Ranch -9.793 2.734 -3.582 0.000
neighborhood_Highland 6.839 2.696 2.536 0.011
neighborhood_Stapleton -6.061 2.532 -2.394 0.017
neighborhood_Union.Station 6.557 2.696 2.432 0.015
bath_type_shared -10.660 2.795 -3.813 0.000

Column

VIP Plot

Column

Model Metrics Comparison

model rmse mae rsq
reg_all_pred 123.485 66.392 0.448
all_predictor_lasso 126.494 64.499 0.406
reg_lasso 125.436 65.756 0.405

Column

Actual vs. Predicted model comparisons

Lasso-VS

Column

Predicting Value Score

The lasso model yielded all significant predictors so no backward elimination process was needed.


Column

Results of logistic regression using lasso predictors

The final equation for the lasso regression is: value_score_cat ~ number_of_reviews_ltm + review_scores_rating + since_first_review

term estimate std.error statistic p.value
(Intercept) 3.908 0.212 18.424 0
number_of_reviews_ltm 3.614 0.305 11.839 0
review_scores_rating -2.292 0.251 -9.144 0
since_first_review 1.290 0.119 10.869 0

Column

Calculated R-square

Measure Value
R-square 0.464

Column

Confusion Matrix

           Truth
Prediction  excellent other
  excellent       128    55
  other            53   680

Column

Model Metrics Comparison

model accuracy sensitivity specificity auc
log_all_pred 0.912 0.776 0.946 0.96
log_lasso_full 0.882 0.707 0.925 0.92

Column

VIP Plot

Column

ROC Curve

Rand.For.-P

Column

Part 1: Using all predictors

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 5
  trees = 500
  min_n = 20

Engine-Specific Arguments:
  importance = impurity
  max.depth = 8

Computational engine: ranger 

Column

Model Metrics Comparison

model rmse mae rsq
reg_all_pred 123.485 66.392 0.448
all_predictor_lasso 126.494 64.499 0.406
reg_lasso 125.436 65.756 0.405
reg_full_rf 151.885 75.146 0.409

Column

VIP Plot

Column

Part 2: Using Lasso predictors only

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 5
  trees = 1000
  min_n = 10

Engine-Specific Arguments:
  importance = impurity
  max.depth = 8

Computational engine: ranger 

Column

Model Metrics Comparison

model rmse mae rsq
reg_all_pred 123.485 66.392 0.448
all_predictor_lasso 126.494 64.499 0.406
reg_lasso 125.436 65.756 0.405
reg_full_rf 151.885 75.146 0.409
reg_lasso_rf 135.934 67.762 0.484

Column

VIP Plot

Column

Actual vs. Predicted model comparisons

Rand.For.-VS

Column

Part 1: Using all predictors

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
value_score_cat ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 4
  trees = 500
  min_n = 10

Engine-Specific Arguments:
  importance = impurity
  max.depth = 8

Computational engine: ranger 

Column

Confusion Matrix

           Truth
Prediction  excellent other
  excellent       105    22
  other            76   713

Column

Model Metrics Comparison

model accuracy sensitivity specificity auc
log_all_pred 0.912 0.776 0.946 0.96
log_lasso_full 0.882 0.707 0.925 0.92
cat_full_rf 0.893 0.580 0.970 0.95

Column

VIP Plot

Column

ROC Curve

Column

Part 2: Using Lasso predictors only

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
value_score_cat ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 2
  trees = 1000
  min_n = 20

Engine-Specific Arguments:
  importance = impurity
  max.depth = 5

Computational engine: ranger 

Column

Confusion Matrix

           Truth
Prediction  excellent other
  excellent       135    41
  other            46   694

Column

Model Metrics Comparison

model accuracy sensitivity specificity auc
log_all_pred 0.912 0.776 0.946 0.96
log_lasso_full 0.882 0.707 0.925 0.92
cat_full_rf 0.893 0.580 0.970 0.95
cat_lasso_rf 0.905 0.746 0.944 0.96

Column

VIP Plot

Column

ROC Curve

Conclusion

Column

Price Model Metrics Comparison

model rmse mae rsq
reg_all_pred 123.485 66.392 0.448
all_predictor_lasso 126.494 64.499 0.406
reg_lasso 125.436 65.756 0.405
reg_full_rf 151.885 75.146 0.409
reg_lasso_rf 135.934 67.762 0.484

Column

VIP Plot for Random Forest Model with Lasso Predictors

Column

Actual vs. Predicted Price model comparisons

Column

Linear regression with lasso predictors

term estimate std.error statistic p.value
(Intercept) 168.213 2.725 61.729 0.000
min_nights -6.955 2.748 -2.531 0.011
max_nights 5.629 2.687 2.095 0.036
review_scores_rating 6.200 2.827 2.193 0.028
num_bath 102.279 2.820 36.270 0.000
neighborhood_CBD 11.250 2.839 3.962 0.000
neighborhood_Civic.Center -6.520 2.966 -2.198 0.028
neighborhood_Cole 13.516 2.619 5.160 0.000
neighborhood_Gateway…Green.Valley.Ranch -9.793 2.734 -3.582 0.000
neighborhood_Highland 6.839 2.696 2.536 0.011
neighborhood_Stapleton -6.061 2.532 -2.394 0.017
neighborhood_Union.Station 6.557 2.696 2.432 0.015
bath_type_shared -10.660 2.795 -3.813 0.000

Column

Value Score Model Metrics Comparison

model accuracy sensitivity specificity auc
log_all_pred 0.912 0.776 0.946 0.96
log_lasso_full 0.882 0.707 0.925 0.92
cat_full_rf 0.893 0.580 0.970 0.95
cat_lasso_rf 0.905 0.746 0.944 0.96

Column

VIP Plot for Random Forest Model with Lasso Predictors

Column

Value Score ROC Curves

Column

Logistic regression with lasso predictors

term estimate std.error statistic p.value
(Intercept) 3.908 0.212 18.424 0
number_of_reviews_ltm 3.614 0.305 11.839 0
review_scores_rating -2.292 0.251 -9.144 0
since_first_review 1.290 0.119 10.869 0

Reflection

Row

Reflection

Most Proud

I am most proud of my data cleaning and variable transformations. I worked hard to make sure that all of the necessary transformations were done in R so that I could import the data straight in from the InsideAirbnb.com website at any time. Some of the code is clunky (I used a massive case statement to recode the bathroom variable), but it works!

If I Had More Time

If I had another week I would try a log transformation on the price variable. It’s just so skewed. I don’t think it would change the fact that number of bathrooms (property size, really) is the dominant factor for price, but it might allow for more insight about what else matters.

For the categorical variable, I would want to try some regular trees using all the predictors. It’s interesting that the overall customer rating is negatively correlated with the value score rating. I’m curious if other score ratings would be positively or negatively correlated to it.

---
title: "Kabira Project"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: scroll
    source_code: embed
    theme: yeti
---

```{r setup, include=FALSE,warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output

library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.3
library(MASS) #v7.3-54 for Boston data
library(flexdashboard) #v0.5.2
library(plotly) #v4.10.1
library(crosstalk) #v1.2.0
library(tidymodels) 
  #library(dplyr) #v1.0.7 %>%, select(), select_if(), filter(), mutate(), group_by(), 
    #summarize(), tibble()
  #library(ggplot2) #v3.3.5 ggplot()
library(ISLR) #v1.4 Default, Auto dataset
library(themis) #v1.0.0 step_smote
library(tidymodels) 
library(parsnip) #v1.0.3 linear_reg(), discrim_regularized(), set_engine(), set_mode(), fit(), predict()
library(yardstick) #v1.1.0 metrics(), roc_auc(), roc_curve(), metric_set(), conf_matrix()
library(dplyr) #v1.0.10 %>%, select(), select_if(), filter(), mutate(), group_by(), 
    #summarize(), tibble()
  #library(ggplot2) #v3.4.0 ggplot()
  #library(broom) #v1.0.2 for tidy(), augment(), glance()
  #library(rsample) #v1.1.1 initial_split(), training(), testing()
library(readr) #v2.1.3 read_csv()
library(knitr) #v1.41 kable()
library(stringr)
theme_set(theme_bw()) #sets default ggplot output style

library(boot) #1.3-28.1 boot()
library(discrim) #v1.0.0 discriminant analysis wrapper
library(janitor) #v2.1.0 clean_names()
library(vip) #0.3.2 vip() (variable importance)
library(glmnet) #v4.1-6 for ridge/lasso regression
library(skimr) #v2.1.5

```

```{r load_data}
# read in compressed file
abb1 <- read_csv("listings.csv.gz")

# for this analysis, I did a lot of recoding of the raw file downloaded from the website


# 1. I eliminated the variables that described the property, as well as the property's id and the host's id
# 2. I recoded TRUE/FALSE (lgl type) variables into factor variables (fctr type)
# 3. I recoded variables that had % or $ signs into numeric values (dbl type)
# 4. I recoded variables that had a limited number of choices (for instance the neighborhood's name) into factor variables
# 5. I eliminated date variables (date type) and made new numeric variables that list the number of elapsed months from the date until 2022.12.31 (the date that the data was scraped from the website)
# 6. I eliminated a variable that included both the number and type of baths and put the number of baths into one new numeric variable and the type of baths into a factor variable
# 7. I created a new (factor type) categorical variable for the purposes of doing a model to predict a categorical variable: if the review_scores_value was 5.0 (the highest possible rating), I coded it as 'excellent'; all other values were coded as 'other'. 
# 8. I deleted all records that had null values. This reduced the number of records from 5250 to 3050.

abb1 <- abb1 %>%
  mutate(host_is_superhost = ifelse(host_is_superhost == TRUE, "yes", "no"),
         host_in_denver = ifelse(host_location == "Denver, CO", "yes", "no"),
         neighborhood = neighbourhood_cleansed, max_guests = accommodates, 
         min_nights = minimum_nights_avg_ntm, max_nights = maximum_nights_avg_ntm) 



abb1 <- abb1 %>%
  select(id, name, host_id, host_name, host_since, host_in_denver, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, neighborhood, latitude, longitude, room_type, max_guests, bathrooms_text, bedrooms, beds, price, min_nights, max_nights, number_of_reviews, number_of_reviews_ltm, first_review, last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, calculated_host_listings_count, reviews_per_month)

abb1 <- abb1 %>%
  mutate(price = ifelse(price == 'N/A' | price== '$0.00', 0, price)) %>%
  mutate(host_response_rate = ifelse(host_response_rate== 'N/A','0%',host_response_rate)) %>%
  mutate(host_acceptance_rate = ifelse(host_acceptance_rate== 'N/A', '0%', host_acceptance_rate)) %>%
  mutate(id = as.character(id), host_id = as.character(host_id)) %>%
  mutate(host_response_time = ifelse(host_response_time== 'N/A', NA, host_response_time)) %>%
  mutate(neighborhood = ifelse(neighborhood == 'Kennedy' | neighborhood == 'Southmoor Park', NA, neighborhood))

abb1 <- abb1 %>%
  mutate(price = as.numeric(gsub("[\\$,]", "", price)))%>%
  mutate(host_response_rate = as.numeric(gsub("%$", "", host_response_rate)))%>%
  mutate(host_acceptance_rate = as.numeric(gsub("%$", "", host_acceptance_rate))) %>%
  mutate(host_response_time = as.factor(host_response_time)) %>%
  mutate(room_type = as.factor(room_type)) %>%
  mutate(neighborhood = as.factor(neighborhood)) %>%
  mutate(host_in_denver = as.factor(host_in_denver)) %>%
  mutate(host_is_superhost = as.factor(host_is_superhost)) %>%
  mutate(price = ifelse(price > 0 & price < 3000, price, NA)) %>%

  mutate(host_response_rate = ifelse(host_response_rate > 0, host_response_rate, NA)) %>%
  mutate(host_acceptance_rate = ifelse(host_acceptance_rate >0, host_acceptance_rate, NA)) %>%
  mutate(host_tenure = as.numeric(difftime("2022-12-31", abb1$host_since, units = "days"))/(365.25/12),
         since_first_review = as.numeric(difftime("2022-12-31", abb1$first_review, units = "days"))/(365.25/12),
          since_last_review = as.numeric(difftime("2022-12-31", abb1$last_review, units = "days"))/(365.25/12)) %>%
  na.omit()

abb1 <- abb1 %>%
  mutate(num_bath = case_when(bathrooms_text == 'Shared half-bath' ~ .5,
                              bathrooms_text == '1 bath' ~ 1,
                              bathrooms_text == '1 shared bath' ~ 1,
                              bathrooms_text == '1 private bath' ~ 1,
                              bathrooms_text == '1.5 baths' ~ 1.5,
                              bathrooms_text == '1.5 shared baths' ~ 1.5,
                              bathrooms_text == '2 baths' ~ 2,
                              bathrooms_text == '2 shared baths' ~ 2,
                              bathrooms_text == '2.5 baths' ~ 2.5,
                              bathrooms_text == '2.5 shared baths' ~ 2.5,
                              bathrooms_text == '3 baths' ~ 3,
                              bathrooms_text == '3 shared baths' ~ 3,
                              bathrooms_text == '3.5 baths' ~ 3.5,
                              bathrooms_text == '3.5 shared baths' ~ 3.5,
                              bathrooms_text == '4 baths' ~ 4,
                              bathrooms_text == '4 shared baths' ~ 4,
                              bathrooms_text == '4.5 baths' ~ 4.5,
                              bathrooms_text == '4.5 shared baths' ~ 4.5,
                              bathrooms_text == '5 baths' ~ 5,
                              bathrooms_text == '5 shared baths' ~ 5,
                              bathrooms_text == '5.5 baths' ~ 5.5,
                              bathrooms_text == '5.5 shared baths' ~ 5.5,
                              bathrooms_text == '6 baths' ~ 6,
                              bathrooms_text == '6 shared baths' ~ 6,
                              bathrooms_text == '6.5 baths' ~ 6.5,
                              bathrooms_text == '6.5 shared baths' ~ 6.5,
                              bathrooms_text == '7 baths' ~ 7,
                              bathrooms_text == '7 shared baths' ~ 7,
                              bathrooms_text == '7.5 baths' ~ 7.5,
                              bathrooms_text == '7.5 shared baths' ~ 7.5,
                              bathrooms_text == '8 baths' ~ 8,
                              bathrooms_text == '8 shared baths' ~ 8,
                              bathrooms_text == '8.5 baths' ~ 8.5,
                              bathrooms_text == '8.5 shared baths' ~ 8.5,
                              bathrooms_text == '9 baths' ~ 9,
                              bathrooms_text == '9 shared baths' ~ 9,
                              bathrooms_text == '9.5 baths' ~ 9.5,
                              bathrooms_text == '9.5 shared baths' ~ 9.5)) %>%
  mutate(bath_type = 
           as.factor(case_when(bathrooms_text == 'Shared half-bath' ~ 'shared',
                              bathrooms_text == '1 bath' ~ 'private',
                              bathrooms_text == '1 shared bath' ~ 'shared',
                              bathrooms_text == '1 private bath' ~ 'private',
                              bathrooms_text == '1.5 baths' ~ 'private',
                              bathrooms_text == '1.5 shared baths' ~ 'shared',
                              bathrooms_text == '2 baths' ~ 'private',
                              bathrooms_text == '2 shared baths' ~ 'shared',
                              bathrooms_text == '2.5 baths' ~ 'private',
                              bathrooms_text == '2.5 shared baths' ~ 'shared',
                              bathrooms_text == '3 baths' ~ 'private',
                              bathrooms_text == '3 shared baths' ~ 'shared',
                              bathrooms_text == '3.5 baths' ~ 'private',
                              bathrooms_text == '3.5 shared baths' ~ 'shared',
                              bathrooms_text == '4 baths' ~ 'private',
                              bathrooms_text == '4 shared baths' ~ 'shared',
                              bathrooms_text == '4.5 baths' ~ 'private',
                              bathrooms_text == '4.5 shared baths' ~ 'shared',
                              bathrooms_text == '5 baths' ~ 'private',
                              bathrooms_text == '5 shared baths' ~ 'shared',
                              bathrooms_text == '5.5 baths' ~ 'private',
                              bathrooms_text == '5.5 shared baths' ~ 'shared',
                              bathrooms_text == '6 baths' ~ 'private',
                              bathrooms_text == '6 shared baths' ~ 'shared',
                              bathrooms_text == '6.5 baths' ~ 'private',
                              bathrooms_text == '6.5 shared baths' ~ 'shared',
                              bathrooms_text == '7 baths' ~ 'private',
                              bathrooms_text == '7 shared baths' ~ 'shared',
                              bathrooms_text == '7.5 baths' ~ 'private',
                              bathrooms_text == '7.5 shared baths' ~ 'shared',
                              bathrooms_text == '8 baths' ~ 'private',
                              bathrooms_text == '8 shared baths' ~ 'shared',
                              bathrooms_text == '8.5 baths' ~ 'private',
                              bathrooms_text == '8.5 shared baths' ~ 'shared',
                              bathrooms_text == '9 baths' ~ 'private',
                              bathrooms_text == '9 shared baths' ~ 'shared',
                              bathrooms_text == '9.5 baths' ~ 'private',
                              bathrooms_text == '9.5 shared baths' ~ 'shared')))


abb1 <- abb1 %>%
  mutate(value_score_cat = as.factor(if_else(abb1$review_scores_value == 5.00, "excellent", "other")))

abb1 <- abb1 %>%
  select(., - bathrooms_text, -id, -name, -host_id, -host_name, -host_since, -first_review, -last_review)


# creating two new datasets, one for continuous predictor and one for categorical predictor

abb_reg <- abb1 %>%
  select(., -value_score_cat)

abb_cat <- abb1 %>%
  select(., -review_scores_value)

```

Exec. Summary {data-orientation=rows}
=======================================================================

Row {data-height=400}
-----------------------------------------------------------------------
### Executive Summary

#### **This project examined two questions:**

1.	What factors influence the price per night of an airbnb property in Denver, Colorado?
2.	What factors influence a property receiving a perfect score for value?

#### **Data**
This is a dataset from a website called InsideAirbnb.com.  It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO.

The original dataset has 3050 rows and 32 variables. The variables were all normalized because of their vast differences in scale (some variables had values in the thousands; others had a maximum value of 5).  The number of normalized variables was 105.

#### **Methodology**
Analyses were done to predict two variables. Price was the continuous variable, and value_score_cat was the categorical variable. There were two values for value_score_cat. A perfect score of 5.0 was coded as "excellent”, and anything else was coded as “other”.

The following methodology was undertaken for each of the variables:
•	Examine the distribution of the variables to look for relationships
•	Conduct simple regression analyses for baselines
•	Normalize all variables
•	Employ lasso regression to narrow down the number of variables
•	Build one random forest model with all predictors and a second one with only the lasso predictors


Column {data-width=400, data-height=400}
-----------------------------------------------------------------------
#### **Conclusions**

### *Price*

Number of baths ended up being the most important predictor of price, by a huge margin. Number of baths is a good proxy for property size: more baths suggests larger property. Price being higher for a larger property makes sense. 

Looking at the analyses as a whole, the three most important factors that a host could change are minimum nights, review scores rating, and maximum nights. We see from the regression coefficients that having a lower number of minimum nights, a higher number of maximum nights, and a higher review scores rating all increase price. The recommendations for a host would be to be flexible (allow for both shorter and longer stays) and to work hard to please the customers. 

### *Value Score*

Review scores rating is the most important predictor of value score. 

Interestingly, review scores rating is **negatively** correlated with value score. This would mean that higher-rated properties overall are rated as having lower value for the price. The other two significant predictors from the lasso regression are the number of reviews in the last 12 months and the number of months since the first review of the property.  These are both positively correlated to value score, which means that properties that have been airbnbs longer and which have had more customers recently receive better ratings for value.

The takeaway from the negative correlation between review scores rating and value score may be that, even though customers have a great overall experience at an airbnb, they may believe that it was overpriced. 

More research should be done before offering airbnb hosts advice on this matter, however.






Introduction {data-orientation=rows}
=======================================================================

Row {data-height=1250}
-----------------------------------------------------------------------
### The Project

#### The Problem Description

This is a dataset from a website called InsideAirbnb.com.  It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO.  The website is http://insideairbnb.com/get-the-data.  Variables from the downloaded detailed dataset have been transformed in R in order to conduct the analyses. The distribution of variables will be examined first in order to look for relationships. Regression analysis will be performed in order to predict a property's per-night price. Following normalization of the variables, lasso regression will be employed to narrow down the number of variables, as well. One random forest model will be run with all predictors and a second one with the lasso predictors.  A classification analysis will be used to predict a customer's rating of the value of the property (whether the property is a good value for the price paid). Because the ratings skew high, there were two ratings assigned: 'excellent' for a perfect rating and 'other' for anything less. Lasso regression will again be used in order to narrow down the number of predictors, followed by logistic regression. One random forest model will be run with all predictors and a second one with the lasso predictors. Finally, a summary of conclusions will be presented.  

#### The Data
This dataset has 3050 rows and 32 variables. 

#### Data Sources
This is a dataset from a website called InsideAirbnb.com.  It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO.  The website is http://insideairbnb.com/get-the-data.  

### The Data
VARIABLES TO PREDICT WITH

* **host_in_denver**: whether or not the host lives in Denver
* **host_response_time**: how quickly a host responds to messages
* **host_response_rate**: the percentage of requests to which a host has responded
* **host_acceptance_rate**: the percentage of booking requests that a host accepts
* **host_is_superhost**: whether a host has earned the Airbnb distinction “Superhost”
* **neighborhood**: neighborhood name of the property’s location 
* **latitude**: latitude of property listing
* **longitude**: longitude of property listing
* **room_type**: whether the entire home/apartment is being rented or if it is a private room
* **max_guests**: maximum number of guests
* **bedrooms**: number of bedrooms
* **beds**: total number of beds
* **price**: per/night price (in US$) - this is a predictor for the classification analysis
* **min_nights**: minimum number of consecutive nights that the property may be rented
* **max_nights**: maximum number of consecutive nights that the property may be rented
* **number_of_reviews**: total number of reviews that a listing has
* **number_of_reviews_ltm**: number of reviews that a listing has had in the last twelve months
* **review_scores_rating**: overall average rating for the property
* **review_scores_accuracy**: average rating for the accuracy of the property’s description
* **review_scores_cleanliness**: average rating for the cleanliness of the property
* **review_scores_checkin**: average rating for the ease of check-in
* **review_scores_communication**: average rating for the host’s communication
* **review_scores_location**: average rating for the property’s location
* **review_scores_value**: average rating for the renters’ assessments of value of the rental experience for the price
* **calculated_host_listings_count**: number of listings that a host has
* **reviews_per_month**: number of total reviews / the number of months that the property has been listed
* **host_tenure**: number of months that the host has been an Airbnb member
* **since_first_review**: number of months since the first review
* **since_last_review**: number of months since the most recent review
* **num_bath**: number of bathrooms available to guests
* **bath_type**: whether the bathroom(s) are shared or private

VARIABLES WE WANT TO PREDICT

* **price**: per/night price (in US$)
* **value_score_cat**: value calculated from the review_scores_value – if the review_scores_value is 5.0, the property is rated as ‘excellent’, if not, it is rated ‘other’

Explorations {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=350}
-------------------------------------

### Data Overview 
From this data we can see that some of the variables have very wide ranges (since_first_review, for instance) and others very narrow ones with some extreme values (min_nights and beds, for instance). Review scores for each category (except for review_scores_value) all have median values of at least 4.9, but the mean value for every one of the review scores is lower than the median value. 

Looking at the average price by max_guests table, we see that - as expected - the mean price tends to increase with the maximum number of guests.  There is one surprise: the max_guests of 15 has a mean price far below the values for 14 and 16 guests. However, the n for 15 max_guests is low so a single low value could skew the mean dramatically.

Column {data-width=450, data-height=1700}
-----------------------------------------------------------------------
### View the Data Summaries
Here are the ranges of values for each variable.
```{r, cache=TRUE}
#View data
summary(abb1)
```

Column {data-width=150, data-height=600}
-----------------------------------------------------------------------
### Average Price by `max_guests` (Maximum number of guests at one time)
```{r, cache=TRUE}

knitr::kable(abb1 %>%
  group_by(max_guests) %>%
  summarize(n=n(), mean(price)),digits=2)
```

Visualizations {data-orientation=rows}
=======================================================================
### Response Variables relationships with predictors

* Unsurprisingly, we see that the price values are very right-skewed. Although the median price is $125, the maximum price is $2614.  Of the continuous predictors, max_guests, beds, bedrooms, and num_bath have the highest correlations with price. That would be logical, since a property that can accommodate more guests would likely command a higher price. 

* We see that the 'excellent' value score category makes up approximately 20% of the total. One interesting finding is that non-superhosts outnumber superhosts in the 'excellent' value score category, even though the percentage of superhosts is much higher than the percentage of hosts who are not superhosts.

* After finding collinearity between beds, bedrooms, num_baths, and max_guests, individual regressions were run between price and the four collinear variables in order to select the one with the largest coefficient. Num_bath, having the highest coefficient of 127.8, was retained in the dataset. The other three variables were eliminated.



row {data-height=550}
-----------------------------------------------------------------------
#### Value Score

```{r, cache=TRUE}
ggplot(abb1,aes(x=value_score_cat)) + geom_bar()
```

#### Price
```{r, cache=TRUE}
ggplot(abb1, aes(price)) + geom_histogram(bins=15)
```


Row {.tabset data-height=480}
-----------------------------------------------------------------------

###  Price vs Continuous Variables #1
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,host_response_rate, host_acceptance_rate, latitude, longitude, max_guests)))
```

###  Price vs Continuous Variables #2
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,bedrooms, beds, price, min_nights, max_nights)))
```

###  Price vs Continuous Variables #3
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,number_of_reviews, number_of_reviews_ltm, review_scores_rating, review_scores_accuracy, review_scores_cleanliness)))
```

###  Price vs Continuous Variables #4
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, calculated_host_listings_count)))
```

###  Price vs Continuous Variables #5
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,reviews_per_month, host_tenure, since_first_review, since_last_review, num_bath)))
```

###  Checking Correlations Between Property Size Proxy Variables
```{r, cache=TRUE}
ggpairs(dplyr::select(abb1, max_guests, num_bath, bedrooms, beds), progress=FALSE)
```

###  Checking Regressions Individually Between Price and Property Size Proxy Variables
```{r}
reg_spec <- linear_reg() %>%
             set_engine('lm') %>%
             set_mode('regression') 
reg_max_guests_fit <- reg_spec %>%
                    fit(price ~ max_guests, data = abb_reg)

tidy(reg_max_guests_fit$fit) %>%
  kable(digits=3)
```
```{r}
reg_num_bath_fit <- reg_spec %>%
                    fit(price ~ num_bath, data = abb_reg)

tidy(reg_num_bath_fit$fit) %>%
  kable(digits=3)
```
```{r}
reg_bedrooms_fit <- reg_spec %>%
                    fit(price ~ bedrooms, data = abb_reg)

tidy(reg_bedrooms_fit$fit) %>%
  kable(digits=3)
```
```{r}
reg_beds_fit <- reg_spec %>%
                    fit(price ~ beds, data = abb_reg)

tidy(reg_beds_fit$fit) %>%
  kable(digits=3)

```


### Value Score vs Host Is Superhost
```{r, cache=TRUE}
abb1 %>% group_by(host_is_superhost, value_score_cat) %>%
  summarize(n=n()) %>%
  ggplot(aes(y=n, x=value_score_cat,fill=host_is_superhost)) +
      geom_bar(position="dodge", stat="identity") +
      geom_text(aes(label=n), position=position_dodge(width=0.9), vjust=-0.25) +
      ggtitle("Value Score vs Host Is Superhost") +
      coord_flip() #makes horizontal
```
```{r}
# eliminate collinear variables (beds, bedrooms, max_guests)
abb_reg <- abb_reg %>%
  dplyr::select(., -beds, -bedrooms, -max_guests)

abb_cat <- abb_cat %>%
  dplyr::select(., -beds, -bedrooms, -max_guests)


```

Initial Models {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Baseline Models

Linear and logistic regression models were run as baseline models.  All predictors were used with these initial models.

We can see from the actual-predicted plot for the linear regression that there are a number of price outliers that are likely causing problems with predictions. Given the large number of outliers, the R-square value of .448 is not surprising.

In contrast, the logistic regression seems to be fairly accurate overall, with accuracy of .912 and an AUC of .96. It is achieving this accuracy with a high specificity of .946 and a rather lower sensitivity of .776.

-----------------------------------------------------------------------


Row{data-height=2000, column-width = 700, .tabset .tabset-fade} 
-----------------------------------------------------------------------

### Predicting Price

Here is the initial regression model predicting price using all predictors.

```{r}

reg_all_pred_fit <- reg_spec %>%
                    fit(price ~ ., data = abb_reg)

tidy(reg_all_pred_fit$fit) %>%
  kable(digits=3)

my_reg_metrics = metric_set(yardstick::rmse, 
                            yardstick::mae, 
                            yardstick::rsq)

pred_reg_all_pred_fit <-reg_all_pred_fit %>%
  augment(abb_reg)

curr_reg_metrics <- pred_reg_all_pred_fit %>%
  my_reg_metrics(truth=price, estimate = .pred)

results_abb_reg <-tibble(model = 'reg_all_pred',
                         rmse = curr_reg_metrics[[1,3]],
                         mae = curr_reg_metrics[[2,3]],
                         rsq = curr_reg_metrics[[3,3]])
```

#### The Full Regression Model Metrics
```{r}
results_abb_reg %>%
  kable(digits=3)
```

#### Actual vs Predicted Graph

```{r, cache=TRUE}
pred_reg_all_pred_fit %>%
  ggplot(aes(x=price, y=.pred)) +
  geom_point(col = "#6e0000") +
      geom_abline(col="gold") + 
      ggtitle("Predicted Price vs Actual Price")
```

### Predicting Value Score Category

Here is the initial logistic regression model predicting value score category using all predictors.

```{r}
log_spec <- logistic_reg() %>%
             set_engine('glm') %>%
             set_mode('classification') 
log_all_pred_fit <- log_spec %>%
                    fit(value_score_cat ~ ., data = abb_cat)

tidy(log_all_pred_fit$fit) %>%
  kable(digits=3)


pred_log_all_pred <- log_all_pred_fit %>%
                    augment(abb_cat)

my_abb_cat_metrics <- metric_set(yardstick::accuracy,yardstick::sensitivity,
                               yardstick::specificity)
abb_cat_curr_metrics <- pred_log_all_pred %>% 
                    my_abb_cat_metrics(truth = value_score_cat, estimate = .pred_class)
curr_auc <- pred_log_all_pred %>%
                  roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
                  pull(.estimate)

results_abb_cat <- tibble(model = 'log_all_pred',
                  accuracy = abb_cat_curr_metrics[[1,3]], 
                  sensitivity = abb_cat_curr_metrics[[2,3]],
                  specificity = abb_cat_curr_metrics[[3,3]],
                  auc = round(curr_auc,2))
```

#### The Full Classification Model Metrics

```{r}
results_abb_cat %>%
  kable(digits = 3)
```

```{r}
# ROC

#Capture the thresholds and sens/spec
abb_cat_roc <- pred_log_all_pred %>% 
  roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>% 
                mutate(model = paste('log_all_pred',round(curr_auc,2)))
```

#### ROC Curve

```{r, cache=TRUE}

#Plot the ROC Curve(s) 
ggplot(abb_cat_roc, 
        aes(x = 1 - specificity, y = sensitivity, 
            group = model, col = model)) +
        geom_path() +
        geom_abline(lty = 3)  +
        scale_color_brewer(palette = "Dark2") +
        theme(legend.position = "top") 

```


Lasso-P {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
----------------------------------------------------------------------

### Lasso Model for Continuous Price Variable

Because of the large number of predictors in the dataset, lasso regression was used for narrowing down the number of predictors for the continuous variable (price). and the categorical variable (value_score_cat).

-----------------------------------------------------------------------

#### Price Predictors

After using the lasso regression, the number of predictors was reduced to 33. When all of these predictors were used in a linear regression model, many of them ended up being statistically insignificant so another 21 variables were eliminated through a backward elimination process. We can see the final regression coefficients below.



We can see from the VIP plot that num_bath is more than 5 times more important than the next largest factor (neighborhood_Cole). Number of bathrooms is a good proxy for the size of the property, and it makes sense that larger properties could command higher prices.


The metric comparison table shows that, both before and after performing backward elimination with the insignificant predictors, the lasso regression has a lower r-square (.406 then .405) than the original linear regression model (.448).  The lasso regressions do have slightly lower mean absolute errors than does the base model.


We can see from the graph comparing the predictions of the original regression model and the final lasso model that both models tend to overpredict at lower prices, with the lasso regression exhibiting that trend more strongly.



Column {data-width=400, data-height=1170}
-----------------------------------------------------------------------

### Predicting Price

Here is the lasso regression model which narrowed down the number of predictors used to predict price.
 


```{r}
#continuous variable splitting into training and testing datasets

abb_reg_recipe <- recipe(price ~ ., data = abb_reg) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_predictors()) %>%
  prep()
abb_reg_norm <- bake(abb_reg_recipe, abb_reg)
 

```
```{r}
set.seed(12934)
abb_reg_split_norm <- initial_split(abb_reg_norm, prop = .7)
abb_reg_train_norm <- rsample::training(abb_reg_split_norm)
abb_reg_test_norm <- rsample::testing(abb_reg_split_norm)

```
```{r}
my_reg_metrics = metric_set(yardstick::rmse, 
                            yardstick::mae, 
                            yardstick::rsq)

#folds
abb_reg_grid <- tibble(penalty=seq(.1, 50, len = 500))
abb_reg_fold <- vfold_cv(abb_reg_train_norm, v=5)

```
```{r}
#Define Lasso Model Specifications
abb_reg_lassotune_spec <- linear_reg(penalty = tune(),
                          mixture = 1) %>% 
                        set_engine("glmnet") %>% 
                        set_mode("regression")

#Create the workflow, add the recipe and tune on penalty
abb_reg_lassotune_wf <- workflow() %>%
                        add_model(abb_reg_lassotune_spec) %>% 
                        add_formula(price ~ .)
abb_reg_lassotune_rs <- abb_reg_lassotune_wf %>%
                          tune_grid(resamples = abb_reg_fold, 
                                    grid =abb_reg_grid,
                                    metrics = my_reg_metrics)
lowest_rmse_lasso <- abb_reg_lassotune_rs %>%
                       select_best("rmse", penalty)

abb_reg_final_lasso <- abb_reg_lassotune_wf %>% 
                  finalize_workflow(lowest_rmse_lasso)
abb_reg_final_lasso_fit <- abb_reg_final_lasso %>% 
                       fit(abb_reg_train_norm)

pred_abb_reg_final_lasso_fit <- abb_reg_final_lasso_fit %>% 
                           augment(abb_reg_train_norm)

```
```{r} 
#abb_reg_final_lasso_fit %>%
#  extract_fit_parsnip() %>%
#  tidy() %>%
#  filter(estimate != 0) %>%
#kable()
```
```{r}
# updating normalized datasets with pared-down variables

abb_reg_norm2 <- abb_reg_norm %>%
  dplyr::select(price, host_response_rate, longitude, min_nights, max_nights, review_scores_rating, since_first_review, num_bath, host_in_denver_yes, neighborhood_Auraria, neighborhood_Belcaro, neighborhood_Berkeley, neighborhood_Capitol.Hill, neighborhood_CBD, neighborhood_Cheesman.Park, neighborhood_City.Park.West, neighborhood_Civic.Center, neighborhood_Cole, neighborhood_Gateway...Green.Valley.Ranch, neighborhood_Goldsmith, neighborhood_Highland, neighborhood_Indian.Creek, neighborhood_Lowry.Field, neighborhood_Mar.Lee, neighborhood_Overland, neighborhood_Stapleton, neighborhood_Union.Station,     neighborhood_Washington.Park.West, neighborhood_Washington.Virginia.Vale, neighborhood_West.Highland, neighborhood_Whittier, room_type_Hotel.room, room_type_Shared.room, bath_type_shared)
``` 
```{r}
set.seed(12938)
abb_reg_split_norm2 <- initial_split(abb_reg_norm2, prop = .7)
abb_reg_train_norm2 <- rsample::training(abb_reg_split_norm2)
abb_reg_test_norm2 <- rsample::testing(abb_reg_split_norm2)

```
```{r}

# new folds for lasso dataset

abb_reg_fold2 <- vfold_cv(abb_reg_train_norm2, v=5)
```
```{r}
reg_spec <- linear_reg() %>%
             set_engine('lm') %>%
             set_mode('regression') 

abb_reg_lasso_norm_fit <- reg_spec %>%
                    fit(price ~ ., data = abb_reg_train_norm2)
```



-----------------------------------------------------------------------

#### Results of linear regression with lasso-reduced predictors

```{r, cache=TRUE}

tidy(abb_reg_lasso_norm_fit$fit) %>%
  kable(digits=3)

```
```{r}
curr_reg_metrics <- pred_abb_reg_final_lasso_fit %>%
  my_reg_metrics(truth=price, estimate = .pred)


# adding results to comparison table
results_abb_reg_new <-tibble(model = 'all_predictor_lasso',
                         rmse = curr_reg_metrics[[1,3]],
                         mae = curr_reg_metrics[[2,3]],
                         rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```


Column {data-width=400, data-height=170}
-----------------------------------------------------------------------

#### Metrics for Model Including All Lasso Predictors

```{r, cache=TRUE}

results_abb_reg %>%
  kable(digits=3)
```
```{r}
#backwards elimination of insignificant predictors

#eliminate neighborhood_Indian.Creek and neighborhood_Whittier

#abb_reg_lasso_noicwh_fit <- reg_spec %>%
#                    fit(price ~ host_response_rate + longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + room_type_Hotel.room + room_type_Shared.room + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_noicwh_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate host_response_rate

#abb_reg_lasso_host_fit <- reg_spec %>%
#                    fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + room_type_Hotel.room + room_type_Shared.room + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_host_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate room_type_Hotel.room

#abb_reg_lasso_nohotel_fit <- reg_spec %>%
#                    fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +  room_type_Shared.room + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nohotel_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate room_type_Shared.room

#abb_reg_lasso_noshared_fit <- reg_spec %>%
#                    fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_noshared_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Cheesman.Park

#abb_reg_lasso_nochees_fit <- reg_spec %>%
#                    fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nochees_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Belcaro

#abb_reg_lasso_nobelcaro_fit <- reg_spec %>%
#                    fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nobelcaro_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Auraria

#abb_reg_lasso_noaur_fit <- reg_spec %>%
#                    fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_noaur_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate longitude

#abb_reg_lasso_nolong_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nolong_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Goldsmith

#abb_reg_lasso_nogold_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station +     neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nogold_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Washington.Park.West

#abb_reg_lasso_nowashw_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station  + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nowashw_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Berkeley

#abb_reg_lasso_noberk_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station  + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_noberk_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Overland

#abb_reg_lasso_noover_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Stapleton + neighborhood_Union.Station  + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_noover_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Lowry.Field

#abb_reg_lasso_nolow_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Mar.Lee + neighborhood_Stapleton + neighborhood_Union.Station  + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nolow_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Mar.Lee

#abb_reg_lasso_nomar_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station  + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland +   bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nomar_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Washington.Virginia.Vale

#abb_reg_lasso_nowav_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station  + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nowav_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_Capitol.Hill

#abb_reg_lasso_nocap_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station  + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nocap_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_West.Highland

#abb_reg_lasso_nowhi_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station  + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nowhi_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate host_in_denver_yes

#abb_reg_lasso_noden_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station  + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_noden_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate since_first_review

#abb_reg_lasso_nofirrev_fit <- reg_spec %>%
#                    fit(price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared, data = abb_reg_train_norm2)

#tidy(abb_reg_lasso_nofirrev_fit$fit) %>%
#  kable(digits=3)
```
```{r}
#eliminate neighborhood_City.Park.West


abb_reg_lasso_nocityw_fit <- reg_spec %>%
                    fit(price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD +  neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared, data = abb_reg_train_norm2)
```



Column {data-width=400, data-height=550}
-----------------------------------------------------------------------
#### Results of linear regression after backward elimination of insignificant predictors

The final equation for the lasso regression is:  price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD +  neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch +  neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared


```{r, cache=TRUE}
tidy(abb_reg_lasso_nocityw_fit$fit) %>%
  kable(digits=3)
```
```{r}
abb_reg_lasso_final_fit <-abb_reg_lasso_nocityw_fit

pred_abb_reg_lasso_final_fit <- abb_reg_lasso_final_fit %>% 
                           augment(abb_reg_train_norm2)
curr_reg_metrics <- pred_abb_reg_lasso_final_fit %>%
  my_reg_metrics(truth=price, estimate = .pred)


# adding results to comparison table
results_abb_reg_new <-tibble(model = 'reg_lasso',
                         rmse = curr_reg_metrics[[1,3]],
                         mae = curr_reg_metrics[[2,3]],
                         rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```



Column {data-width=400, data-height=550}
-----------------------------------------------------------------------

#### VIP Plot

```{r, cache=TRUE}
vip(abb_reg_lasso_final_fit)
```


Column {data-width=400, data-height=200}
-----------------------------------------------------------------------

#### Model Metrics Comparison

```{r}
results_abb_reg %>%
  kable(digits=3)
```


Column {data-width=400, data-height=650}
--------------------------------------------------------------------------

#### Actual vs. Predicted model comparisons

```{r, cache=TRUE}

reg_pred <- bind_rows(pred_reg_all_pred_fit %>%
  mutate(model = "Simple Regression"),
  pred_abb_reg_lasso_final_fit %>%
  mutate(model = "Lasso Regression"))
reg_pred %>%
  ggplot(aes(x=price, y=.pred, col=model)) +
  geom_point(alpha=.40) +
  xlab("Actual Price") +
  ylab("Predicted Price") +
  xlim(c(0,2625)) +
  geom_abline(col="gold") + 
  ggtitle("Comparing Simple Regression and Lasso Regression Models")
```

Lasso-VS {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
-----------------------------------------------------------------------

### Lasso Model for Categorical Value Score Variable

Because of the large number of predictors in the dataset, lasso regression was used for narrowing down the number of predictors for the categorical variable (value_score_cat). The two categories are Excellent (a perfect score) and Other.

We can see that the lasso regression narrowed the number of predictors down to three, and all three relate to customer reviews. Value score is a rating that indicates the extent to which the customer believes that the price they paid was worth the experience they had.

The most significant predictor was the number of reviews in the last 12 months. Its positive value indicates that a larger number of reviews made an Excellent score more likely. The second most significant predictor was the review scores rating. This is the overall rating that customers give. Interestingly, it is a negative number, meaning that higher overall ratings make an Excellent value score less likely. The third predictor is the number of months since the property's first review. This positive value shows that the longer a property has been an airbnb, the more likely it is to receive an Excellent value score.

Because the calculated R-square value is .464, we can see that these three predictors account for not-quite-half of the variability in the data. As the other predictors were eliminated in the lasso regression, we can infer that there important factors which are not captured in the data that is being measured. The metrics comparison shows that the regular logistic regression model with all of the predictors included is more accurate overall and has higher sensitivity, specificity, and auc. Finally, we can see clearly see the difference in the models' auc values on the ROC curve.




Column {data-width=400, data-height=110}
-----------------------------------------------------------------------

### Predicting Value Score

The lasso model yielded all significant predictors so no backward elimination process was needed. 
 
-----------------------------------------------------------------------

```{r} 
#categorical variable splitting into training and testing datasets

abb_cat_recipe <- recipe(value_score_cat ~ ., data = abb_cat) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_predictors()) %>%
  prep()
abb_cat_norm <- bake(abb_cat_recipe, abb_cat)
 #names(abb_cat_norm)

set.seed(12934)
abb_cat_split_norm <- initial_split(abb_cat_norm, prop = .7, strata=value_score_cat)
abb_cat_train_norm <- rsample::training(abb_cat_split_norm)
abb_cat_test_norm <- rsample::testing(abb_cat_split_norm)

```
```{r}
#metrics
my_abb_cat_metrics <- metric_set(yardstick::accuracy,
                               yardstick::sensitivity,
                               yardstick::specificity)

```
```{r}

#folds
abb_cat_grid <- tibble(penalty=seq(.1, 50, len = 500))
abb_cat_fold <- vfold_cv(abb_cat_train_norm, v=5, strata=value_score_cat)

```
```{r}
#Define Model Specifications
abb_cat_lassotune_spec <- logistic_reg(penalty = tune(),
                          mixture = 1) %>% 
                        set_engine("glmnet") %>% 
                        set_mode("classification")

#Create the workflow, add the recipe and tune on penalty
abb_cat_lassotune_wf <- workflow() %>%
                        add_model(abb_cat_lassotune_spec) %>% 
                        add_formula(value_score_cat ~ .)
abb_cat_lassotune_rs <- abb_cat_lassotune_wf %>%
                          tune_grid(resamples = abb_cat_fold, 
                                    grid =abb_cat_grid) #,
                                    #metrics = my_abb_cat_metrics)
lowest_roc_auc_lasso <- abb_cat_lassotune_rs %>%
                       select_best("roc_auc", penalty)

abb_cat_final_lasso <- abb_cat_lassotune_wf %>% 
                  finalize_workflow(lowest_roc_auc_lasso)
abb_cat_final_lasso_fit <- abb_cat_final_lasso %>% 
                       fit(abb_cat_train_norm)

pred_abb_cat_final_lasso_fit <- abb_cat_final_lasso_fit %>% 
                           augment(abb_cat_train_norm)
```
```{r}
#print(paste('The lowest ROC_AUC Lasso penalty is',lowest_roc_auc_lasso$penalty))
```
```{r}
#abb_cat_final_lasso_fit %>%
#  extract_fit_parsnip() %>%
#  tidy() %>%
#kable()
```
```{r}
#abb_cat_final_lasso_fit %>%
#  extract_fit_parsnip() %>%
#  tidy() %>%
#  filter(estimate != 0) %>%
#kable()
```
```{r}

#my_abb_cat_metrics <- metric_set(yardstick::accuracy,yardstick::sensitivity,
#                               yardstick::specificity)
#abb_cat_curr_metrics <- pred_abb_cat_final_lasso_fit %>% 
#                    my_abb_cat_metrics(truth = value_score_cat, estimate = .pred_class)
#curr_auc <- pred_abb_cat_final_lasso_fit %>%
#                  roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
#                  pull(.estimate)

#results_abb_cat_new <- tibble(model = 'first_lasso',
#                  accuracy = abb_cat_curr_metrics[[1,3]], 
#                  sensitivity = abb_cat_curr_metrics[[2,3]],
#                  specificity = abb_cat_curr_metrics[[3,3]],
#                  auc = round(curr_auc,2))
#results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```
```{r}
# create new normalized data and splits with smaller variable list
abb_cat_norm2 <- abb_cat_norm %>%
  dplyr::select(value_score_cat, number_of_reviews_ltm, review_scores_rating, since_first_review)

set.seed(12938)
abb_cat_split_norm2 <- initial_split(abb_cat_norm2, prop = .7, strata=value_score_cat)
abb_cat_train_norm2 <- rsample::training(abb_cat_split_norm2)
abb_cat_test_norm2 <- rsample::testing(abb_cat_split_norm2)

```
```{r}

# new folds for lasso dataset
abb_cat_fold2 <- vfold_cv(abb_cat_train_norm2, v=5, strata=value_score_cat)

log_spec <- logistic_reg() %>%
             set_engine('glm') %>%
             set_mode('classification') 
log_lasso_full_fit <- log_spec %>%
                    fit(value_score_cat ~ ., data = abb_cat_train_norm2)
#summary(log_lasso_full_fit$fit)
```
```{r}
#Column {data-width=400, data-height=100}
#-----------------------------------------------------------------------

#### A Designation?

#```{r, cache=TRUE}
#glance(log_lasso_full_fit$fit) %>%
#  kable(digits=3)
```


Column {data-width=400, data-height=300} 
-----------------------------------------------------------------------

#### Results of logistic regression using lasso predictors

The final equation for the lasso regression is: value_score_cat ~ number_of_reviews_ltm + review_scores_rating + since_first_review

```{r, cache=TRUE}
tidy(log_lasso_full_fit$fit) %>%
  kable(digits=3)
```



Column {data-width=400, data-height=150} 
-----------------------------------------------------------------------

#### Calculated R-square

```{r, cache=TRUE}
#find R^2
dev <- glance(log_lasso_full_fit$fit) %>%
          pull(deviance)
null_dev <-glance(log_lasso_full_fit$fit) %>%
          pull(null.deviance)
lasso_rsq <- tibble(Measure="R-square",
                    Value = 1 - (dev/null_dev))
lasso_rsq %>%
  kable(digits=3)
```



Column {data-width=400, data-height=170}  
-----------------------------------------------------------------------

#### Confusion Matrix

```{r, cache=TRUE}
#confusion matrix
pred_log_lasso_full <- log_lasso_full_fit %>%
                    augment(abb_cat_test_norm)

pred_log_lasso_full %>%
  conf_mat(truth = value_score_cat, estimate = .pred_class)

```
```{r}
#metrics
my_abb_cat_metrics <- metric_set(yardstick::accuracy,yardstick::sensitivity,
                               yardstick::specificity)
abb_cat_curr_metrics <- pred_log_lasso_full %>% 
                    my_abb_cat_metrics(truth = value_score_cat, estimate = .pred_class)
curr_auc <- pred_log_lasso_full %>%
                  roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
                  pull(.estimate)

results_abb_cat_new <- tibble(model = 'log_lasso_full',
                  accuracy = abb_cat_curr_metrics[[1,3]], 
                  sensitivity = abb_cat_curr_metrics[[2,3]],
                  specificity = abb_cat_curr_metrics[[3,3]],
                  auc = round(curr_auc,2))
results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```



Column {data-width=400, data-height=220}  
-----------------------------------------------------------------------

#### Model Metrics Comparison

```{r}
results_abb_cat %>%
  kable(digits = 3)
```



Column {data-width=400, data-height=520} 
-----------------------------------------------------------------------

#### VIP Plot

```{r, cache=TRUE}
vip(log_lasso_full_fit)
```
```{r}
# ROC

#Capture the thresholds and sens/spec
abb_cat_roc <- bind_rows(abb_cat_roc, 
                         pred_log_lasso_full %>% 
  roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>% 
                mutate(model = paste('log_lasso_full',round(curr_auc,2))))
```



Column {data-width=400, data-height=590} 
-----------------------------------------------------------------------

#### ROC Curve

```{r, cache=TRUE}

#Plot the ROC Curve(s) 
ggplot(abb_cat_roc, 
        aes(x = 1 - specificity, y = sensitivity, 
            group = model, col = model)) +
        geom_path() +
        geom_abline(lty = 3)  +
        scale_color_brewer(palette = "Dark2") +
        theme(legend.position = "top") 
```


Rand.For.-P {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
----------------------------------------------------------------------

### Random Forest Model for Price Variable

Two random forest models were run: one with all of the available predictors and one using only the predictors resulting from the lasso regression. For both of the models I tried the following parameters:

* **Number of variables to include (mtry)**: 2,3,4,5
* **Number of trees (trees)**: 500, 1000
* **Minimum number of records per leaf (min_n)**: 5, 10, 15, 20
* **Maximum depth of tree (max.depth)**: 5, 6, 7, 8

We can see that the best model for all predictors had these arguments:                    
  mtry = 5
  trees = 500
  min_n = 20
  max.depth = 8

It did slightly better than the regular or lasso regressions in terms of R-square, but it had markedly higher error values. 

The best model for the lasso predictor-only model had these arguments:
  mtry = 5
  trees = 1000
  min_n = 10
  max.depth = 8
  
It had the best R-square of all of the models at .484, and its rmse and mae values were much lower than those of the initial random forest model, though still higher than the two regression models.

We see from the two graphs of actual and predicted values that the random forest models - particularly the lasso one - do better at predicting the lower price properties.

The VIP plots also show the number of bathrooms as the most important predictor by far. This has - unsurprisingly - been the case for all of the models.


Column {data-width=400, data-height=520}
-----------------------------------------------------------------------

### Part 1: Using all predictors


```{r}
#Random Forest for Predicting Price
#Part 1: using all predictors

#use original folds

rf_grid <- expand_grid(mtry = 2:5,
                       trees = c(500, 1000),
                       min_n = c(5,10,15,20),
                       max.depth = c(5,6,7,8))

rf_reg_tune_spec <- rand_forest(mtry = tune(),
                                trees = tune(),
                                min_n = tune()) %>%
                    set_engine("ranger",
                               importance = "impurity",
                               max.depth = tune()) %>%
                    set_mode("regression")

reg_tree_wf <- workflow() %>%
                add_model(rf_reg_tune_spec) %>%
                add_formula(price ~ .)

reg_tree_full_rs <- reg_tree_wf %>%
                tune_grid(resamples = abb_reg_fold,
                          grid = rf_grid)

#finalize workflow
final_reg_tree_full_wf <- reg_tree_wf %>%
  finalize_workflow(select_best(reg_tree_full_rs))
final_reg_tree_full_wf

```
```{r}
#fit the model
set.seed(1996)
final_reg_tree_full_fit<-final_reg_tree_full_wf %>%
  fit(data=abb_reg_train_norm)

#calculate metrics
pred_final_reg_tree_full <- final_reg_tree_full_fit%>%
  augment(abb_reg_test_norm)


```
```{r}
# adding results to comparison table
curr_reg_metrics <- pred_final_reg_tree_full%>%
  my_reg_metrics(truth=price, estimate=.pred)

results_abb_reg_new <-tibble(model = 'reg_full_rf',
                         rmse = curr_reg_metrics[[1,3]],
                         mae = curr_reg_metrics[[2,3]],
                         rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```



Column {data-width=400, data-height=220} 
-----------------------------------------------------------------------

#### Model Metrics Comparison

```{r, cache=TRUE}
results_abb_reg %>%
  kable(digits=3)
```



Column {data-width=400, data-height=520} 
-----------------------------------------------------------------------

#### VIP Plot

```{r, cache=TRUE}
#VIP plot
final_reg_tree_full_fit %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "#6e0000", col = "black"))

```



Column {data-width=400, data-height=520}
-----------------------------------------------------------------------

### Part 2: Using Lasso predictors only


```{r}
#Part 2: using lasso predictors only
# re-use grid, model specification, workflow from Part 1

reg_tree_lasso_rs <- reg_tree_wf%>%
                    tune_grid(resamples = abb_reg_fold2,
                              grid = rf_grid)

#finalize workflow
final_reg_tree_lasso_wf <- reg_tree_wf %>%
  finalize_workflow(select_best(reg_tree_lasso_rs))
final_reg_tree_lasso_wf

```
```{r}
#fit the model
set.seed(1998)
final_reg_tree_lasso_fit<-final_reg_tree_lasso_wf %>%
  fit(data=abb_reg_train_norm2)

#calculate metrics
pred_final_reg_tree_lasso <- final_reg_tree_lasso_fit%>%
  augment(abb_reg_test_norm2)


```
```{r}

# adding results to comparison table
curr_reg_metrics <- pred_final_reg_tree_lasso%>%
  my_reg_metrics(truth=price, estimate=.pred)

results_abb_reg_new <-tibble(model = 'reg_lasso_rf',
                         rmse = curr_reg_metrics[[1,3]],
                         mae = curr_reg_metrics[[2,3]],
                         rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```



Column {data-width=400, data-height=220} 
-----------------------------------------------------------------------

#### Model Metrics Comparison

```{r, cache=TRUE}
results_abb_reg %>%
  kable(digits=3)
```



Column {data-width=400, data-height=520} 
-----------------------------------------------------------------------

#### VIP Plot

```{r, cache=TRUE}
#VIP plot
final_reg_tree_lasso_fit %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "#6e0000", col = "black"))

```


Column {data-width=400, data-height=1300}
--------------------------------------------------------------------------

#### Actual vs. Predicted model comparisons

```{r, cache=TRUE}

reg_pred <- bind_rows(pred_final_reg_tree_full %>%
    mutate(model="Full Random Forest"),
  pred_final_reg_tree_lasso %>%
    mutate(model="Lasso Random Forest"))
reg_pred %>%
  ggplot(aes(x=price, y=.pred, col=model)) +
  geom_point(alpha=.40) +
  xlab("Actual Price") +
  ylab("Predicted Price") +
  xlim(c(0,2625)) +
  geom_abline(col="gold") + 
  ggtitle("Comparing Full and Lasso Random Forest Models")
```


```{r, cache=TRUE}
reg_pred <- bind_rows(pred_reg_all_pred_fit %>%
    mutate(model = "Simple Regression"),
  pred_abb_reg_lasso_final_fit %>%
    mutate(model = "Lasso Regression"),
  pred_final_reg_tree_full %>%
    mutate(model="Full Random Forest"),
  pred_final_reg_tree_lasso %>%
    mutate(model="Lasso Random Forest"))
reg_pred %>%
  ggplot(aes(x=price, y=.pred, col=model)) +
  geom_point(alpha=.40) +
  xlab("Actual Price") +
  ylab("Predicted Price") +
  xlim(c(0,2625)) +
  geom_abline(col="gold") + 
  ggtitle("Comparing Regression and Random Forest Models")
```


Rand.For.-VS {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
-------------------------------------

### Random Forest Model for Value Score Variable

Two random forest models were run: one with all of the available predictors and one using only the predictors resulting from the lasso regression. For the model using all predictors I tried the following parameters:

* **Number of variables to include (mtry)**: 2,3,4,5
* **Number of trees (trees)**: 500, 1000
* **Minimum number of records per leaf (min_n)**: 5, 10, 15, 20
* **Maximum depth of tree (max.depth)**: 5, 6, 7, 8

Because there were only three variables remaining after the lasso regression the second tree model had this change:

* **Number of variables to include**: 2,3

We can see that the best model for all predictors had these arguments:                    
  mtry = 4
  trees = 500
  min_n = 10
  max.depth = 8

It did better than the regular or lasso logistic regressions in terms of specificity. It did better than the lasso regression for accuracy and auc, but it did the worst - by far - on sensitivity. 


The best model for the lasso predictor-only model had these arguments:
  mtry = 2
  trees = 1000
  min_n = 20
  max.depth = 5

With only 3 predictors, it is not surprising that the max depth is less and the min_n is more.
  
With an auc of .96, the random forest model with lasso predictors tied the auc of the full logistic regression model with all predictors. The lasso random forest did almost as well as the full logistic regression model.  The ROC curve shows these comparisons nicely.

All the important predictors for both models are all related to review scores, and the most important predictor according to both models is review_scores_rating, which is the overall rating that a customer gives. For the model with all predictors, the number of reviews is a close second.

 



Column {data-width=400, data-height=520}
-----------------------------------------------------------------------

### Part 1: Using all predictors


```{r} 
#Random Forest for Predicting Value Score Category
#Part 1: using all predictors

#use original folds and random forest grid previously defined

rf_cat_tune_spec <- rand_forest(mtry = tune(),
                                trees = tune(),
                                min_n=tune()) %>%
                    set_engine("ranger",
                               importance = "impurity",
                               max.depth = tune()) %>%
                    set_mode("classification")

cat_tree_wf <- workflow() %>%
                add_model(rf_cat_tune_spec) %>%
                add_formula(value_score_cat ~ .)

cat_tree_full_rs <- cat_tree_wf %>%
                tune_grid(resamples = abb_cat_fold,
                          grid = rf_grid)
#finalize workflow
final_cat_tree_full_wf <- cat_tree_wf %>%
  finalize_workflow(select_best(cat_tree_full_rs))
final_cat_tree_full_wf
```
```{r}
#fit the model
set.seed(1996)
final_cat_tree_full_fit<-final_cat_tree_full_wf %>%
  fit(data=abb_cat_train_norm)

#calculate metrics
pred_final_cat_tree_full <- final_cat_tree_full_fit%>%
  augment(abb_cat_test_norm)


```



Column {data-width=400, data-height=170}  
-----------------------------------------------------------------------

#### Confusion Matrix

```{r, cache=TRUE}
#confusion matrix

pred_final_cat_tree_full %>%
  conf_mat(truth = value_score_cat, estimate = .pred_class)

```
```{r}

# adding results to comparison table
abb_cat_curr_metrics <- pred_final_cat_tree_full%>%
  my_abb_cat_metrics(truth=value_score_cat, estimate=.pred_class)

curr_auc <- pred_final_cat_tree_full %>%
                  roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
                  pull(.estimate)

results_abb_cat_new <- tibble(model = 'cat_full_rf',
                  accuracy = abb_cat_curr_metrics[[1,3]], 
                  sensitivity = abb_cat_curr_metrics[[2,3]],
                  specificity = abb_cat_curr_metrics[[3,3]],
                  auc = round(curr_auc,2))
results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```



Column {data-width=400, data-height=220} 
-----------------------------------------------------------------------

#### Model Metrics Comparison

```{r, cache=TRUE}
results_abb_cat %>%
  kable(digits = 3)
```



Column {data-width=400, data-height=520} 
-----------------------------------------------------------------------

#### VIP Plot

```{r, cache=TRUE}

#VIP plot
final_cat_tree_full_fit %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "#6e0000", col = "black"))

```
```{r}
# ROC

#Capture the thresholds and sens/spec
abb_cat_roc <- bind_rows(abb_cat_roc, 
                         pred_final_cat_tree_full %>% 
  roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>% 
                mutate(model = paste('rf_tree_full',round(curr_auc,2))))
```



Column {data-width=400, data-height=590} 
-----------------------------------------------------------------------

#### ROC Curve

```{r, cache=TRUE}

#Plot the ROC Curve(s) 
ggplot(abb_cat_roc, 
        aes(x = 1 - specificity, y = sensitivity, 
            group = model, col = model)) +
        geom_path() +
        geom_abline(lty = 3)  +
        scale_color_brewer(palette = "Dark2") +
        theme(legend.position = "top") 

```



Column {data-width=400, data-height=520}
-----------------------------------------------------------------------

### Part 2: Using Lasso predictors only


```{r}
#Part 2: using lasso predictors only
# re-use model specification, workflow from Part 1

#specify new grid b/c only 3 predictors captured by lasso
rf_grid <- expand_grid(mtry = c(2,3),
                       trees = c(500, 1000),
                       min_n = c(5,10,15,20),
                       max.depth = c(5,6,7,8))


cat_tree_lasso_rs <- cat_tree_wf%>%
                    tune_grid(resamples = abb_cat_fold2,
                              grid = rf_grid)
#finalize workflow
final_cat_tree_lasso_wf <- cat_tree_wf %>%
  finalize_workflow(select_best(cat_tree_lasso_rs))
final_cat_tree_lasso_wf

```
```{r}
#fit the model
set.seed(1998)
final_cat_tree_lasso_fit<-final_cat_tree_lasso_wf %>%
  fit(data=abb_cat_train_norm2)

#calculate metrics
pred_final_cat_tree_lasso <- final_cat_tree_lasso_fit%>%
  augment(abb_cat_test_norm2)

```



Column {data-width=400, data-height=170}  
-----------------------------------------------------------------------

#### Confusion Matrix

```{r, cache=TRUE}
#confusion matrix

pred_final_cat_tree_lasso %>%
  conf_mat(truth = value_score_cat, estimate = .pred_class)

```
```{r}

# adding results to comparison table
abb_cat_curr_metrics <- pred_final_cat_tree_lasso%>%
  my_abb_cat_metrics(truth=value_score_cat, estimate=.pred_class)

curr_auc <- pred_final_cat_tree_lasso %>%
                  roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
                  pull(.estimate)

results_abb_cat_new <- tibble(model = 'cat_lasso_rf',
                  accuracy = abb_cat_curr_metrics[[1,3]], 
                  sensitivity = abb_cat_curr_metrics[[2,3]],
                  specificity = abb_cat_curr_metrics[[3,3]],
                  auc = round(curr_auc,2))
results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```



Column {data-width=400, data-height=220} 
-----------------------------------------------------------------------

#### Model Metrics Comparison

```{r, cache=TRUE}

results_abb_cat %>%
  kable(digits = 3)
```



Column {data-width=400, data-height=520} 
-----------------------------------------------------------------------

#### VIP Plot

```{r, cache=TRUE}

#VIP plot
final_cat_tree_lasso_fit %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "#6e0000", col = "black"))

```
```{r}
# ROC

#Capture the thresholds and sens/spec
abb_cat_roc <- bind_rows(abb_cat_roc, 
                         pred_final_cat_tree_lasso %>% 
  roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>% 
                mutate(model = paste('rf_tree_lasso',round(curr_auc,2))))
```



Column {data-width=400, data-height=590} 
-----------------------------------------------------------------------

#### ROC Curve

```{r, cache=TRUE}

#Plot the ROC Curve(s) 
ggplot(abb_cat_roc, 
        aes(x = 1 - specificity, y = sensitivity, 
            group = model, col = model)) +
        geom_path() +
        geom_abline(lty = 3)  +
        scale_color_brewer(palette = "Dark2") +
        theme(legend.position = "top") 
```


Conclusion {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
-----------------------------------------------------------------------

### Project Conclusion

**Price**
Number of baths is the most important predictor of price, by a huge margin. Number of baths is a good proxy for property size: more baths suggests larger property. Price being higher for a larger property makes sense. 

If we want to find out what can be done to charge a higher price for a property, though, we need to look to the VIP plot (for relative importance) and the regression coefficients (to see if they are positive or negative, in this case). The chosen VIP plot is from the best-performing model overall; the coefficients are from the best regression model. Though the two models don't agree on the ranking, we can still gain insight.

The three most important factors that a host could change are minimum nights, review scores rating, and maximum nights. We see from the regression coefficients that having a lower number of minimum nights, a higher number of maximum nights, and a higher review scores rating all increase price. The takeaway for a host is to be flexible (allow for both shorter and longer stays) and to work hard to please the customers. 

Regarding the models themselves, we see from the metrics comparison and the actual vs predicted plot that the random forest model using the lasso predictors does the best at explaining the data, especially at the lower end of the price scale. The wide range of prices makes it difficult to predict the ones at the higher end of the scale. This is likely why none of the models were able to explain even half of the variation in the data (the highest R-square was .484).


--------------------------------------------------------------------

**Value Score**
Review scores rating is the most important predictor of value score, by far, according to the random forest model that uses only lasso predictors. We can see that this model has the highest auc value and the second-best accuracy, sensitivity, and specificity values. I chose this model because it is simpler than the logistic regression model with all predictors (and because only 22 of the 105 normalized predictors from that model had p-values < .1). 

The coefficients from the logistic regression with lasso predictors give us the additional information that review scores rating is **negatively** correlated with value score. This would mean that higher-rated properties overall are rated as having lower value for the price. The other two significant predictors from the lasso regression are the number of reviews in the last 12 months and the number of months since the first review of the property.  These are both positively correlated to value score, which means that properties that have been airbnbs longer and which have had more customers recently receive better ratings for value.

The takeaway from the negative correlation between review scores rating and value score may be that, even though customers have a great overall experience at an airbnb, they may believe that it was overpriced. 

More research should be done before offering airbnb hosts advice on this matter, however.



Column {data-width=400, data-height=220} 
-----------------------------------------------------------------------

#### Price Model Metrics Comparison

```{r, cache=TRUE}
results_abb_reg %>%
  kable(digits=3)
```



Column {data-width=400, data-height=520} 
-----------------------------------------------------------------------

#### VIP Plot for Random Forest Model with Lasso Predictors

```{r, cache=TRUE}
#VIP plot
final_reg_tree_lasso_fit %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "#6e0000", col = "black"))

```


Column {data-width=400, data-height=650}
--------------------------------------------------------------------------

#### Actual vs. Predicted Price model comparisons

```{r, cache=TRUE}
reg_pred <- bind_rows(pred_reg_all_pred_fit %>%
    mutate(model = "Simple Regression"),
  pred_abb_reg_lasso_final_fit %>%
    mutate(model = "Lasso Regression"),
  pred_final_reg_tree_full %>%
    mutate(model="Full Random Forest"),
  pred_final_reg_tree_lasso %>%
    mutate(model="Lasso Random Forest"))
reg_pred %>%
  ggplot(aes(x=price, y=.pred, col=model)) +
  geom_point(alpha=.40) +
  xlab("Actual Price") +
  ylab("Predicted Price") +
  xlim(c(0,2625)) +
  geom_abline(col="gold") + 
  ggtitle("Comparing Regression and Random Forest Models")
```



Column {data-width=400, data-height=550}
-----------------------------------------------------------------------
#### Linear regression with lasso predictors


```{r, cache=TRUE}
tidy(abb_reg_lasso_nocityw_fit$fit) %>%
  kable(digits=3)
```




Column {data-width=400, data-height=220} 
-----------------------------------------------------------------------

#### Value Score Model Metrics Comparison

```{r, cache=TRUE}

results_abb_cat %>%
  kable(digits = 3)
```



Column {data-width=400, data-height=520} 
-----------------------------------------------------------------------

#### VIP Plot for Random Forest Model with Lasso Predictors

```{r, cache=TRUE}

#VIP plot
final_cat_tree_lasso_fit %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "#6e0000", col = "black"))

```



Column {data-width=400, data-height=590} 
-----------------------------------------------------------------------

#### Value Score ROC Curves

```{r, cache=TRUE}

#Plot the ROC Curve(s) 
ggplot(abb_cat_roc, 
        aes(x = 1 - specificity, y = sensitivity, 
            group = model, col = model)) +
        geom_path() +
        geom_abline(lty = 3)  +
        scale_color_brewer(palette = "Dark2") +
        theme(legend.position = "top") 

```


Column {data-width=400, data-height=300} 
-----------------------------------------------------------------------

#### Logistic regression with lasso predictors

```{r, cache=TRUE}
tidy(log_lasso_full_fit$fit) %>%
  kable(digits=3)
```



Reflection {data-orientation=rows}
=======================================================================

Row {data-height=1250}
-----------------------------------------------------------------------
### Reflection

**Most Proud**

I am most proud of my data cleaning and variable transformations. I worked hard to make sure that all of the necessary transformations were done in R so that I could import the data straight in from the InsideAirbnb.com website at any time. Some of the code is clunky (I used a massive case statement to recode the bathroom variable), but it works! 

**If I Had More Time**

If I had another week I would try a log transformation on the price variable.  It's just so skewed. I don't think it would change the fact that number of bathrooms (property size, really) is the dominant factor for price, but it might allow for more insight about what else matters.

For the categorical variable, I would want to try some regular trees using all the predictors. It's interesting that the overall customer rating is negatively correlated with the value score rating. I'm curious if other score ratings would be positively or negatively correlated to it.